3_RNAseq

Master Medical Biometry/Biostatistics, Introduction to Bioinformatics, Medizinische Fakultät Heidelberg


The purpose of these exercises is to introduce you to the post-processing of RNA-seq data, in this case the differential expression analysis.

Provide answers to questions that are marked with ‘Q’.

The Data


Leishmaniasis, is a disease caused by protozoan parasites of the genus Leishmania and spread by the bite of certain types of sandflies. Samples from infected and healthy individuals were sequenced. You can find the article of this study here.

  • The raw data (fastq files) was downloaded from the ENA (European Nucleotide Archive) database, under the project E-MTAB-4902.

These are the samples used:

Run Accession Type ID
ERR1523942 Normal Ctrl1
ERR1523943 Normal Ctrl2
ERR1523944 Normal Ctrl3
ERR1523945 Normal Ctrl4
ERR1523946 Normal Ctrl5
ERR1523947 Visceral Leishmaniasis VL1
ERR1523948 Visceral Leishmaniasis VL2
ERR1523949 Visceral Leishmaniasis VL3
  • These samples were mapped towards the human genome
  • Gene counts were generated with HTseq and stored in geneCounts.txt.

Download the file from here and unzip it.

Loading packages

There are many software packages for this, such as DESeq2 and edgeR. For today´s exercise we will use DESeq2, an R package that estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution. DESeq2 uses the raw read counts per gene as input.

Start R in your local computer.

Select your working directory (where you have the geneCounts.txt file) by clicking:

Session -> Set Working Directory -> Choose directory

This is a copy/paste tutorial so we can focus on the interpretation of the results; however it is important to understand what each line is doing, so if you have any questions, just raise your hand!

Load the following libraries:

library(DESeq2)
## Loading required package: S4Vectors
## Loading required package: stats4
## Loading required package: BiocGenerics
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, aperm, append, as.data.frame, basename, cbind,
##     colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find,
##     get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff, table,
##     tapply, union, unique, unsplit, which.max, which.min
## 
## Attaching package: 'S4Vectors'
## The following object is masked from 'package:utils':
## 
##     findMatches
## The following objects are masked from 'package:base':
## 
##     expand.grid, I, unname
## Loading required package: IRanges
## 
## Attaching package: 'IRanges'
## The following object is masked from 'package:grDevices':
## 
##     windows
## Loading required package: GenomicRanges
## Loading required package: GenomeInfoDb
## Loading required package: SummarizedExperiment
## Loading required package: MatrixGenerics
## Loading required package: matrixStats
## 
## Attaching package: 'MatrixGenerics'
## The following objects are masked from 'package:matrixStats':
## 
##     colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse,
##     colCounts, colCummaxs, colCummins, colCumprods, colCumsums,
##     colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs,
##     colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats,
##     colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds,
##     colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads,
##     colWeightedMeans, colWeightedMedians, colWeightedSds,
##     colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet,
##     rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods,
##     rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps,
##     rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins,
##     rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks,
##     rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars,
##     rowWeightedMads, rowWeightedMeans, rowWeightedMedians,
##     rowWeightedSds, rowWeightedVars
## Loading required package: Biobase
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
## 
## Attaching package: 'Biobase'
## The following object is masked from 'package:MatrixGenerics':
## 
##     rowMedians
## The following objects are masked from 'package:matrixStats':
## 
##     anyMissing, rowMedians
library(pheatmap)
library(RColorBrewer)
library(DT)
library(kableExtra)
library(EnhancedVolcano)
## Loading required package: ggplot2
## Loading required package: ggrepel

Let’s define the metadata for our samples, this will set the colors to be displayed in our graphs and define the grouping for the analysis:

# Defining the sample names 
sampleNames <- c("Ctrl1", "Ctrl2", "Ctrl3", "Ctrl4", "Ctrl5", "VL1", "VL2", "VL3")

# Defining the sample grouping
conditionNames <- data.frame(condition = c(rep("Ctrl",5), rep("VL",3)))

# Defining the Control and the Visceral Leishmaniasis group for plotting
annotation_col <- data.frame(Type = c(rep("Ctrl",5), rep("VL",3)))

# Defining the colors by group for plotting 
annotation_colors <- list(Type = c(Ctrl = "blue", 
                                   VL   = "red"))

rownames(annotation_col) = sampleNames

Now we load the gene counts:

# Reading the sample information 
setwd("C:/Users/marce/OneDrive/Desktop")

counts <- read.table("geneCounts.txt", 
                     header    = TRUE, 
                     row.names = 1, 
                     sep       = "\t", 
                     as.is     = TRUE)
colnames(counts) <- sampleNames
head(counts)
##                 Ctrl1 Ctrl2 Ctrl3 Ctrl4 Ctrl5   VL1   VL2   VL3
## ENSG00000000003    19    13    12     9    10     2     5     4
## ENSG00000000419   470   368   273   205   281   398   378   714
## ENSG00000000457   813   816   754   665   837   835   771   687
## ENSG00000000460   213   180   112   110    94   233   225   209
## ENSG00000000938 18215 28421 24508 24414 29163 28888 13267 24022
## ENSG00000000971    26    30    91    57    61    47   121    57
                Ctrl1 Ctrl2 Ctrl3 Ctrl4 Ctrl5   VL1   VL2   VL3
ENSG00000000003    19    13    12     9    10     2     5     4
ENSG00000000419   470   368   273   205   281   398   378   714
ENSG0000000457   813   816   754   665   837   835   771   687
ENSG00000000460   213   180   112   110    94   233   225   209
ENSG00000000938 18215 28421 24508 24414 29163 28888 13267 24022
ENSG00000000971    26    30    91    57    61    47   121    57

And construct the main data object. The DESeqDataSetFromMatrix function constructs a DESeq2 data set object integrating the counts, the samples description and the experimental design:

# Creating the DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData = counts, 
                              colData   = conditionNames, 
                              design    =~ condition)
## Warning in DESeqDataSet(se, design = design, ignoreRank): some variables in
## design formula are characters, converting to factors
head(dds)
## class: DESeqDataSet 
## dim: 6 8 
## metadata(1): version
## assays(1): counts
## rownames(6): ENSG00000000003 ENSG00000000419 ... ENSG00000000938
##   ENSG00000000971
## rowData names(0):
## colnames(8): Ctrl1 Ctrl2 ... VL2 VL3
## colData names(1): condition

Quality control and dimensional reduction

One of the first things that we need to do is to assess the overall similarity between the samples. We can create a PCA plot:

# Data transformation
rld <- rlogTransformation(dds)

# Ploting PCA
plotPCA(rld)
## using ntop=500 top features by variance

Or we can look at the Euclidean distance:

# Calculating distance among samples
sampleDist <- dist(t(assay(rld)))
sampleDistMtx <- as.matrix(sampleDist)
rownames(sampleDistMtx) <- sampleNames
colnames(sampleDistMtx) <- sampleNames

# Plotting the distances
# Select between Blues, Reds, Purples, Oranges or Greens
hmcol = colorRampPalette( rev(brewer.pal(9, "Purples")) )(100)

pheatmap(sampleDistMtx, 
#         cluster_rows      = sampleDist, 
#         cluster_cols      = sampleDist, 
         main              = "Sample distances", 
         fontsize_row      = 12, 
         annotation_col    = annotation_col, 
         annotation_colors = annotation_colors, 
         col = hmcol)

Q1. Is the clustering consistent with our different groups?

Now we can proceed with the differential expression testing. The DESeq function does all the hard work, it basically performs three steps:

  • Compute a scaling factor for each sample to account for differences in read depth and complexity between samples
  • Estimate the variance among samples
  • Test for differences in expression among groups

Differential expression

# DE test
dds <- DESeq(dds)
## estimating size factors
## estimating dispersions
## gene-wise dispersion estimates
## mean-dispersion relationship
## final dispersion estimates
## fitting model and testing
colnames(dds) <- sampleNames
head(dds)
## class: DESeqDataSet 
## dim: 6 8 
## metadata(1): version
## assays(4): counts mu H cooks
## rownames(6): ENSG00000000003 ENSG00000000419 ... ENSG00000000938
##   ENSG00000000971
## rowData names(22): baseMean baseVar ... deviance maxCooks
## colnames(8): Ctrl1 Ctrl2 ... VL2 VL3
## colData names(2): condition sizeFactor
# Extracting the results
res <- results(dds)
head(res)
## log2 fold change (MLE): condition VL vs Ctrl 
## Wald test p-value: condition VL vs Ctrl 
## DataFrame with 6 rows and 6 columns
##                    baseMean log2FoldChange     lfcSE      stat     pvalue
##                   <numeric>      <numeric> <numeric> <numeric>  <numeric>
## ENSG00000000003     9.14582     -1.8138943  0.637905 -2.843516 0.00446188
## ENSG00000000419   374.42530      0.5802608  0.279446  2.076470 0.03785052
## ENSG00000000457   770.53260     -0.0726711  0.200377 -0.362671 0.71685048
## ENSG00000000460   168.28316      0.6359161  0.269238  2.361909 0.01818110
## ENSG00000000938 23847.51068     -0.2688650  0.306979 -0.875843 0.38111551
## ENSG00000000971    62.98197      0.4261796  0.556256  0.766157 0.44358272
##                      padj
##                 <numeric>
## ENSG00000000003 0.0392143
## ENSG00000000419 0.1601584
## ENSG00000000457 0.8627233
## ENSG00000000460 0.1004275
## ENSG00000000938 0.6296845
## ENSG00000000971 0.6831105

Q2. How many genes have you tested? use nrow(res)

nrow(res)
## [1] 26171

You can create a histogram of the padjusted and an MA-plot to inspect if there is any enrichment of DE genes:

# Padjusted histogram
hist(res$padj, 
     breaks = 100, 
     col    = "green", 
     border = "darkgreen", 
     main   = "padjusted distribution", 
     xlab   = "padjusted")

# MA-plot
plotMA(res, main = "VL_vs_Ctrl")

Q3. What does the histogram tells you? What are the highest and lowest fold changes (approx) found in our dataset?

min(res$log2FoldChange)
## [1] -7.525772
max(res$log2FoldChange)
## [1] 7.732408

There are genes that have very low counts across all the samples, these can’t be tested for DE so they end up with an NA p-adj. Let’s remove them:

# checking how many genes have 
sum(is.na(res$padj))
## [1] 6660
# removing the NA padj
res <- res[ !is.na(res$padj), ]

Q4. How many genes were removed?

nrow(res)
## [1] 19511

We can now explore our data a little bit more. We can focus on genes that are significant according to some criteria: false discovery rate (FDR) or log fold change. For example, we can filter the results so that 1% are expected to be false positives (genes with no actual difference in expression):

alfa = 0.05
res_sig <- res[ which(res$padj < alfa), ]
res_sig
## log2 fold change (MLE): condition VL vs Ctrl 
## Wald test p-value: condition VL vs Ctrl 
## DataFrame with 2447 rows and 6 columns
##                   baseMean log2FoldChange     lfcSE      stat      pvalue
##                  <numeric>      <numeric> <numeric> <numeric>   <numeric>
## ENSG00000000003    9.14582      -1.813894  0.637905  -2.84352 4.46188e-03
## ENSG00000001084  684.88455       0.907674  0.212682   4.26776 1.97450e-05
## ENSG00000001497  688.87295      -0.771775  0.185793  -4.15395 3.26782e-05
## ENSG00000004399 1987.64309      -1.156232  0.375515  -3.07905 2.07658e-03
## ENSG00000004468 2498.66712       1.823404  0.372718   4.89219 9.97213e-07
## ...                    ...            ...       ...       ...         ...
## ENSG00000272620   10.53923       -2.20575  0.757704  -2.91110 3.60154e-03
## ENSG00000272742    5.09606        4.40812  1.170485   3.76606 1.65844e-04
## ENSG00000272763   17.72455        2.38666  0.561591   4.24982 2.13938e-05
## ENSG00000273149  899.49876        2.67861  0.438745   6.10515 1.02703e-09
## ENSG00000273442    2.66793       -4.58895  1.600247  -2.86765 4.13531e-03
##                        padj
##                   <numeric>
## ENSG00000000003 3.92143e-02
## ENSG00000001084 7.16238e-04
## ENSG00000001497 1.07882e-03
## ENSG00000004399 2.33523e-02
## ENSG00000004468 6.66323e-05
## ...                     ...
## ENSG00000272620 3.37997e-02
## ENSG00000272742 3.62756e-03
## ENSG00000272763 7.58936e-04
## ENSG00000273149 1.94547e-07
## ENSG00000273442 3.70961e-02
datatable(as.matrix(res_sig))

Q5. How many significant genes do you have?

nrow(res)
## [1] 19511

To see the top 20 most upregulated genes:

top = 20
print(as.matrix(res_sig[ order(res_sig$log2FoldChange, decreasing = TRUE),][1:top,]))
##                     baseMean log2FoldChange     lfcSE     stat       pvalue
## ENSG00000174358   127.930347       7.732408 1.3790641 5.606997 2.058675e-08
## ENSG00000204644    26.715322       7.418433 2.2926147 3.235796 1.213041e-03
## ENSG00000178752   343.840573       6.790581 0.7617054 8.914970 4.879473e-19
## ENSG00000268460    14.599232       6.536879 1.2171300 5.370732 7.841764e-08
## ENSG00000084734     6.307033       6.501329 1.2245724 5.309060 1.101919e-07
## ENSG00000251095     8.936396       6.404727 1.3896083 4.609016 4.045787e-06
## ENSG00000124749     5.803896       6.388858 1.5923904 4.012118 6.017648e-05
## ENSG00000206178     8.390654       6.310398 1.4684681 4.297266 1.729179e-05
## ENSG00000119630    22.893879       6.209982 1.1969620 5.188119 2.124284e-07
## ENSG00000237568    11.507078       6.184514 1.2234873 5.054825 4.307860e-07
## ENSG00000196565 36493.660290       6.094282 1.1513928 5.292965 1.203489e-07
## ENSG00000133742  5093.855886       6.085722 0.6552484 9.287656 1.577231e-20
## ENSG00000182704     6.053909       5.836124 1.7722503 3.293059 9.910378e-04
## ENSG00000224091     5.961097       5.806018 1.3013471 4.461545 8.137103e-06
## ENSG00000217416     3.846914       5.784767 1.4933008 3.873813 1.071458e-04
## ENSG00000073754     5.617134       5.729365 1.2323434 4.649163 3.332844e-06
## ENSG00000148734     7.916617       5.639529 1.3492045 4.179892 2.916475e-05
## ENSG00000087085   213.085623       5.525337 0.7295231 7.573902 3.621765e-14
## ENSG00000248227     3.061367       5.468280 1.5903005 3.438520 5.849037e-04
## ENSG00000230601     3.056478       5.452693 1.5320956 3.558977 3.723024e-04
##                         padj
## ENSG00000174358 2.608235e-06
## ENSG00000204644 1.600246e-02
## ENSG00000178752 1.586723e-15
## ENSG00000268460 7.886632e-06
## ENSG00000084734 1.048758e-05
## ENSG00000251095 2.055660e-04
## ENSG00000124749 1.741993e-03
## ENSG00000206178 6.563815e-04
## ENSG00000119630 1.786505e-05
## ENSG00000237568 3.362026e-05
## ENSG00000196565 1.139867e-05
## ENSG00000133742 7.693338e-17
## ENSG00000182704 1.385110e-02
## ENSG00000224091 3.641354e-04
## ENSG00000217416 2.652948e-03
## ENSG00000073754 1.748041e-04
## ENSG00000148734 9.844870e-04
## ENSG00000087085 2.826570e-11
## ENSG00000248227 9.517978e-03
## ENSG00000230601 6.795129e-03
kbl(as.matrix(res_sig[ order(res_sig$log2FoldChange, decreasing = TRUE),][1:top,]), 
    caption = "VL_vs_Ctrl Significant DE genes") %>%
        kable_classic(full_width = F, html_font = "Cambria")
VL_vs_Ctrl Significant DE genes
baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000174358 127.930347 7.732408 1.3790641 5.606997 0.0000000 0.0000026
ENSG00000204644 26.715322 7.418433 2.2926147 3.235796 0.0012130 0.0160025
ENSG00000178752 343.840573 6.790581 0.7617054 8.914970 0.0000000 0.0000000
ENSG00000268460 14.599232 6.536879 1.2171300 5.370732 0.0000001 0.0000079
ENSG00000084734 6.307033 6.501329 1.2245724 5.309060 0.0000001 0.0000105
ENSG00000251095 8.936396 6.404727 1.3896083 4.609016 0.0000040 0.0002056
ENSG00000124749 5.803896 6.388858 1.5923904 4.012118 0.0000602 0.0017420
ENSG00000206178 8.390654 6.310397 1.4684681 4.297266 0.0000173 0.0006564
ENSG00000119630 22.893879 6.209982 1.1969620 5.188119 0.0000002 0.0000179
ENSG00000237568 11.507078 6.184514 1.2234873 5.054825 0.0000004 0.0000336
ENSG00000196565 36493.660290 6.094282 1.1513928 5.292965 0.0000001 0.0000114
ENSG00000133742 5093.855886 6.085722 0.6552484 9.287656 0.0000000 0.0000000
ENSG00000182704 6.053909 5.836124 1.7722503 3.293059 0.0009910 0.0138511
ENSG00000224091 5.961097 5.806018 1.3013471 4.461544 0.0000081 0.0003641
ENSG00000217416 3.846914 5.784767 1.4933008 3.873813 0.0001071 0.0026529
ENSG00000073754 5.617134 5.729366 1.2323434 4.649163 0.0000033 0.0001748
ENSG00000148734 7.916617 5.639529 1.3492045 4.179892 0.0000292 0.0009845
ENSG00000087085 213.085623 5.525337 0.7295231 7.573902 0.0000000 0.0000000
ENSG00000248227 3.061367 5.468280 1.5903005 3.438520 0.0005849 0.0095180
ENSG00000230601 3.056478 5.452693 1.5320956 3.558977 0.0003723 0.0067951

Q6. Make a note of the genes. Change the decreasingargument to FALSE, what type of genes do you get? Make also a note of these genes.

kbl(as.matrix(res_sig[ order(res_sig$log2FoldChange, decreasing = FALSE),][1:top,]), 
    caption = "VL_vs_Ctrl Significant DE genes") %>%
        kable_classic(full_width = F, html_font = "Cambria")
VL_vs_Ctrl Significant DE genes
baseMean log2FoldChange lfcSE stat pvalue padj
ENSG00000142408 40.055401 -7.525772 1.3864284 -5.428172 0.0000001 0.0000062
ENSG00000105366 245.062705 -6.462171 1.4942735 -4.324624 0.0000153 0.0006010
ENSG00000138395 8.202745 -6.210153 1.4382885 -4.317738 0.0000158 0.0006090
ENSG00000170558 26.778054 -6.054004 1.2766064 -4.742263 0.0000021 0.0001195
ENSG00000103355 357.316605 -5.935067 1.2072165 -4.916323 0.0000009 0.0000604
ENSG00000122733 24.171931 -5.882075 1.3418652 -4.383507 0.0000117 0.0004879
ENSG00000215148 6.275105 -5.825145 1.4864657 -3.918789 0.0000890 0.0023183
ENSG00000205927 83.380742 -5.625245 1.4350377 -3.919929 0.0000886 0.0023133
ENSG00000261327 5.375587 -5.598636 1.4030454 -3.990346 0.0000660 0.0018619
ENSG00000146122 122.607798 -5.550336 0.9217390 -6.021591 0.0000000 0.0000003
ENSG00000239265 5.198719 -5.546517 1.9439918 -2.853159 0.0043287 0.0384690
ENSG00000130433 82.227780 -5.473095 0.9926391 -5.513680 0.0000000 0.0000041
ENSG00000130037 4.648468 -5.389323 1.3911242 -3.874077 0.0001070 0.0026529
ENSG00000260105 4.659556 -5.375267 1.4009836 -3.836781 0.0001247 0.0029940
ENSG00000254396 4.445459 -5.321777 1.4053036 -3.786924 0.0001525 0.0034324
ENSG00000225449 4.215827 -5.245733 1.7440426 -3.007801 0.0026315 0.0272807
ENSG00000161905 626.710813 -5.191873 1.4454512 -3.591870 0.0003283 0.0061534
ENSG00000106078 3.678201 -5.040158 1.4364829 -3.508679 0.0004503 0.0078451
ENSG00000187498 7.183593 -5.034032 1.3375390 -3.763653 0.0001674 0.0036545
ENSG00000168269 3.522239 -4.996943 1.5667296 -3.189410 0.0014256 0.0179654

It is also important to have a graphical representation of these results. A heat map of the significantly DE genes can be generated as follows:

# Selecting the rlog-transformed counts of the DE genes
sig2heatmap <- assay(rld)[ rownames( assay(rld) ) %in% rownames(res_sig), ]

# Calculating the difference of the counts to the mean value
sig2heatmap_Mean <- sig2heatmap - rowMeans(sig2heatmap)
colnames(sig2heatmap_Mean) <- sampleNames

# Plotting
pheatmap(sig2heatmap_Mean, 
         main              = paste("VL_vs_Ctrl Significant DE genes (", alfa, ")", sep=" "), 
         annotation_col    = annotation_col, 
         annotation_colors = annotation_colors, 
         show_rownames     = FALSE)

Q7. What can you conclude from the heatmap?

You can of course plot the top 20 most significant genes:

# Selecting the top 20 genes based on adjusted pvalue
res_top <-res_sig[ order(res_sig$padj), ][ 1:top, ]

# Selecting the rlog-transformed counts of the DE genes
top2heatmap <- assay(rld)[ rownames(assay(rld)) %in% rownames(res_top), ]
colnames(top2heatmap) <- sampleNames

# Calculating the difference of the counts to the mean value
top2heatmap_Mean <- top2heatmap - rowMeans(top2heatmap)
colnames(top2heatmap_Mean) <- sampleNames

# Plotting
pheatmap(top2heatmap_Mean, 
         main              = paste("VL_vs_Ctrl Significant DE genes (", alfa, "-", top, "genes )", sep = " "),
         annotation_col    = annotation_col, 
         annotation_colors = annotation_colors)

Q8. What can you conclude from this heatmap? Are the plotted genes in your top 20 most up/down regulated present in this list?

And all the genes at the same time:

EnhancedVolcano(res,                                         # data to plot
                lab            = rownames(res),              # labels (protein name)
                x              = "log2FoldChange",           # value to plot in the X-axis
                y              = "pvalue",                   # value to plot in the Y-axis
                pCutoff        = 0.001,                      # cut off for the pvalue
                FCcutoff       = 2,                          # cut off for the fold change
                title          = "VL_vs_Ctrl",               # Title of the plot
                subtitle       = NULL,                       # Remove subtitle to create more space for the plot
                caption        = NULL,                       # Remove caption to create more space for the plot
                legendPosition = "top",                      # Position the legend on top of the plot
                axisLabSize    = 11,                         # Set font size for axis labels
                labSize        = 2.8,                        # Set font size for protein labels
                )